Finding and removing duplicate records

Problem

You want to find and/or remove duplicate entries from a vector or data frame.

Solution

With vectors:

# Generate a vector 
set.seed(158)
x <- round(rnorm(20, 10, 5))
# 14 11  8  4 12  5 10 10  3  3 11  6  0 16  8 10  8  5  6  6

# For each element: is this one a duplicate (first instance of a particular value not counted)
duplicated(x)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
#[13] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

# The values of the duplicated entries
# note that '6' appears in the original vector three times, and so it has two entries here
x[duplicated(x)]
# [1] 10  3 11  8 10  8  5  6  6

# Duplicated entries, without repeats
unique(x[duplicated(x)])
# 10  3 11  8  5  6

# The original vector with all duplicates removed. These do the same:
unique(x)
x[!duplicated(x)]
# 14 11  8  4 12  5 10  3  6  0 16

With data frames:

# A sample data frame:
df <- read.table(header=T, con <- textConnection('
 label value
     A     4
     B     3
     C     6
     B     3
     B     1
     A     2
     A     4
     A     4
'))
close(con)

# Is each row a repeat?
duplicated(df)
# FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE

# Show the repeat entries
df[duplicated(df),]
# label value
#     B     3
#     A     4
#     A     4

# Show unique repeat entries 
unique(df[duplicated(df),])
# label value
#     B     3
#     A     4

# Original data with repeats removed. These do the same:
unique(df)
df[!duplicated(df),]
# label value
#     A     4
#     B     3
#     C     6
#     B     1
#     A     2